Skip to content

Conversation

LukeAVanDrie
Copy link
Contributor

@LukeAVanDrie LukeAVanDrie commented Sep 20, 2025

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it:

This PR addresses several critical bugs in the endpoint picker that were discovered during refactoring work. These fixes are essential for stability, reliability, and observability.

The key changes are:

  1. Fixes Critical Streaming Metrics Bug: For text/event-stream responses, token usage metrics are being undercounted. The logic only parsed the final [DONE] message for a usage block, failing to accumulate token counts from earlier messages. Furthermore, a context corruption bug caused the model_name label in Prometheus to be empty, rendering the metric unusable for per-model monitoring. This PR corrects the logic to accumulate tokens across the entire stream and ensures all metric labels are correctly populated.

  2. Fixes Header Parsing Reliability: The EPP was brittle in how it parsed headers, exclusively checking the RawValue field and ignoring the Value field. This could lead to two severe production issues:

    • Streaming Detection Failure: If a client sent content-type: text/event-stream using the Value field, the EPP would fail to detect the stream and attempt to buffer the entire response, risking high memory usage and OOM kills.
    • Broken Distributed Tracing: The x-request-id header could be missed, breaking the distributed trace and hindering debugging.
      This PR makes all header parsing robust by correctly checking both fields.
  3. Complete Hermetic Test Overhaul: The integration test suite has been refactored to be significantly more robust and maintainable. The previous shared-state, table-driven test has been replaced with a testHarness structure that provides full resource and state isolation for each test case, eliminating flakes and improving clarity.

Which issue(s) this PR fixes:
Fixes #1624
Fixes #1626

Does this PR introduce a user-facing change?:

- Fixes a critical bug where token usage metrics for streaming responses were undercounted and recorded with a missing `model_name` label.
- Fixes a bug that could cause the gateway to buffer streaming responses incorrectly.
- Improves the robustness of header parsing to prevent dropped `x-request-id` headers, ensuring distributed traces remain intact.

This commit addresses several critical bugs discovered in the External
Processing Proxy (EPP) that impact reliability, observability, and
correctness, particularly for streaming use cases.

**Bug Fixes:**

1.  **Correct Streaming Token Metrics:**
    - **Problem:** For streaming responses (e.g., `text/event-stream`),
      token usage metrics were recorded incorrectly. The logic only
      inspected the final `[DONE]` message for a `usage` block, failing
      to accumulate token counts from earlier messages in the stream.
      Additionally, the `IncomingModelName` was being overwritten by a
      blank value from the request body, causing the `model_name` label
      in Prometheus to be empty.
    - **Fix:** The response handler now correctly accumulates token
      counts from all streaming chunks into the `RequestContext`. The
      final, accurate count is recorded only when the `[DONE]` message
      is received. The request handler logic was reordered to ensure
      headers (containing the model name) are always processed before
      the body, preventing the context from being corrupted.

2.  **Robust Header Parsing:**
    - **Problem:** Multiple locations in the codebase exclusively
      checked the `RawValue` field of an Envoy `HeaderValue` message,
      ignoring the valid `Value` field. This caused failures in
      detecting the `content-type` for streaming and loss of the
      `x-request-id` for tracing if a client sent them in the `Value`
      field.
    - **Fix:** All header parsing logic has been updated to check both
      `RawValue` and `Value`, making it robust and compliant with the
      Envoy API.

**Refactoring:**

- **Hermetic Test Overhaul:** The integration test suite in
  `test/integration/epp/hermetic_test.go` has been completely refactored
  for reliability and clarity.
- The old, monolithic, table-driven test has been replaced with a
  `testHarness` structure that provides each test case with its own
   isolated server instance, Kubernetes resources (scoped by a unique
   label), and gRPC client.
- This eliminates test interference and makes the suite significantly
  more stable and maintainable. While true parallelism is still blocked
  by a global metrics registry in controller-runtime, this change
  achieves full resource and state isolation.
- Test cases are now grouped by functionality (`RequestRouting`,
  `ResponseHandling`, etc.) with clear, descriptive names.
- All associated test utilities and documentation have been polished to
  improve readability and maintainability.
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 20, 2025
@k8s-ci-robot
Copy link
Contributor

@LukeAVanDrie: The label(s) kind/test cannot be applied, because the repository doesn't have them.

In response to this:

What type of PR is this?

/kind bug
/kind test
/kind cleanup

What this PR does / why we need it:

This PR addresses several critical bugs in the endpoint picker that were discovered during refactoring work. These fixes are essential for stability, reliability, and observability.

The key changes are:

  1. Fixes Critical Streaming Metrics Bug: For text/event-stream responses, token usage metrics are being undercounted. The logic only parsed the final [DONE] message for a usage block, failing to accumulate token counts from earlier messages. Furthermore, a context corruption bug caused the model_name label in Prometheus to be empty, rendering the metric unusable for per-model monitoring. This PR corrects the logic to accumulate tokens across the entire stream and ensures all metric labels are correctly populated.

  2. Fixes Header Parsing Reliability: The EPP was brittle in how it parsed headers, exclusively checking the RawValue field and ignoring the Value field. This could lead to two severe production issues:

  • Streaming Detection Failure: If a client sent content-type: text/event-stream using the Value field, the EPP would fail to detect the stream and attempt to buffer the entire response, risking high memory usage and OOM kills.
  • Broken Distributed Tracing: The x-request-id header could be missed, breaking the distributed trace and hindering debugging.
    This PR makes all header parsing robust by correctly checking both fields.
  1. Complete Hermetic Test Overhaul: The integration test suite has been refactored to be significantly more robust and maintainable. The previous shared-state, table-driven test has been replaced with a testHarness structure that provides full resource and state isolation for each test case, eliminating flakes and improving clarity.

Which issue(s) this PR fixes:
Fixes #1624
Fixes #1626

Does this PR introduce a user-facing change?:

- Fixes a critical bug where token usage metrics for streaming responses were undercounted and recorded with a missing `model_name` label.
- Fixes a bug that could cause the gateway to buffer streaming responses incorrectly.
- Improves the robustness of header parsing to prevent dropped `x-request-id` headers, ensuring distributed traces remain intact.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 20, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LukeAVanDrie
Once this PR has been reviewed and has the lgtm label, please assign nirrozenbaum for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 20, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 20, 2025
Copy link

netlify bot commented Sep 20, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 5fd9820
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68ce1ca931d43f0008b9d04f
😎 Deploy Preview https://deploy-preview-1627--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.


requests = append(requests, headerReq)
requests = append(requests, GenerateRequest(logger, prompt, model, filterMetadata))
// Simulate a multi-chunk body by splitting the marshaled JSON.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a small improvement to better reflect actual streaming use cases.

inference_objective_input_tokens_bucket{model_name="",target_model_name="",le="+Inf"} 1
inference_objective_input_tokens_sum{model_name="",target_model_name=""} 7
inference_objective_input_tokens_count{model_name="",target_model_name=""} 1
inference_objective_input_tokens_bucket{model_name="sql-lora-sheddable",target_model_name="sql-lora-1fdg3",le="1"} 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verification that the labels are being applied properly now.

@LukeAVanDrie
Copy link
Contributor Author

Here is a mapping of the original test cases in HEAD to the new, refactored test cases to verify that the refactoring of the hermetic test suite was not lossy. Hopefully this helps with reviewing as the delta is quite large.

All test scenarios are preserved. The only change in assertions is the intentional correction of the streaming metrics bug.

Original Test Case (HEAD) Refactored Test Case (This PR) Analysis
Request Routing & Scheduling Scenarios
select lower queue and kv cache, no active lora RequestRouting/selects pod with lower queue and kv cache Coverage Preserved. Assertions are identical.
select active lora, low queue RequestRouting/selects pod with active lora and low queue Coverage Preserved. Assertions are identical.
select lora despite higher kv cache usage RequestRouting/selects pod with lora affinity despite higher kv cache Coverage Preserved. Assertions are identical.
don't shed requests by default RequestRouting/routes to least-saturated pod when all pods are under high load Coverage Preserved. This tests the same "no shedding" behavior by ensuring the least-bad pod is chosen. Assertions are equivalent.
body sent over multiple requests, noncritical... RequestRouting/routes request with multi-chunk body Coverage Preserved. The core scenario (routing a request with a body split across multiple gRPC messages) is identical.
inferenceobjective's modelName is not translated, passthrough RequestRouting/passthrough for models not defined in objectives Coverage Preserved. Assertions are identical.
Response Handling Scenarios
responsebody sent over multiple requests, content-type is json, buffer ResponseHandling/buffers and rewrites multi-chunk json response Coverage Preserved. Assertions are identical.
Response is invalid json; return body ResponseHandling/handles invalid json in response body Coverage Preserved. Assertions are identical.
responsebody sent over a single request, but empty body with EndOfStream... (1st) ResponseHandling/handles single chunk response followed by empty EOS chunk Coverage Preserved. This correctly tests Envoy's behavior of sending an empty final chunk. Assertions are identical.
responsebody sent over a single request, but empty body with EndOfStream... (2nd) ResponseHandling/passes through and counts tokens in event-stream response Coverage Intentionally Changed (Bug Fix). The old test asserted the buggy behavior (empty model_name label). The new test asserts the correct, fully-populated labels.
Subsetting Scenarios
select active lora with subsetting tag, all pods available Subsetting/selects best pod from available subset Coverage Preserved. Assertions are identical.
select active lora with subsetting tag, some pods match Subsetting/selects only available pod in subset despite high load Coverage Preserved. Assertions are identical.
select active lora with subsetting tag, no pods available Subsetting/returns error when no pods match subset Coverage Preserved. Assertions are identical.
Error Condition Scenarios
invalid json; return body (request-side) ErrorConditions/invalid json in request body Coverage Preserved. Assertions are identical.
request don't contains invalid payload, model not exist ErrorConditions/request body is missing model field Coverage Preserved. Assertions are identical.
no backend pods are available ErrorConditions/no backend pods available in datastore Coverage Preserved. Assertions are identical.
Request Type Scenarios
simple GET Request RequestTypes/simple GET request is passed through Coverage Preserved. Assertions are identical.

@ahg-g
Copy link
Contributor

ahg-g commented Sep 20, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 20, 2025
@k8s-ci-robot
Copy link
Contributor

@LukeAVanDrie: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gateway-api-inference-extension-test-unit-main 5fd9820 link true /test pull-gateway-api-inference-extension-test-unit-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

func (s *StreamingServer) HandleRequestHeaders(reqCtx *RequestContext, req *extProcPb.ProcessingRequest_RequestHeaders) error {
reqCtx.RequestReceivedTimestamp = time.Now()

// Headers must be processed first to populate the request context as subsequent logic (like body processing) may
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following, where in this function do we use the reqCtx that would require moving this logic earlier? it is fine to move it up, but wanted to double check where the bug is in this specific case.

// depend upon it.
for _, header := range req.RequestHeaders.Headers.Headers {
key := header.Key
// Per the Envoy API, a header's value can be in either the `RawValue` (bytes) or `Value` (string) field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fix ExtractHeaderValue then? see https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/util/request/headers.go#L35;

Probably best if we have a function for the if/else logic that we use across the code for reading the header value and name it ExtractHeaderValue and rename the current ExtractHeaderValue to ExtractCaseInsensitiveHeaderValue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a much cleaner solution, thanks! I missed this helper as I was tracing the request path. Most of these were discovered through hermetic test failures post-refactoring, and I am not deeply familiar with this part of the codebase. I will consolidate this into the helper.

reqCtx.Request.Headers[requtil.RequestIdHeaderKey] = requestID // update in headers so director can consume it
}
// Ensure the request ID, whether pre-existing or newly generated, is in the context's header map.
// This makes it available to all downstream logic (e.g,. the director).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to understand if this is really a bug, in the case the header already existed, HandleRequestHeaders would have set it in the reqCtx.Request.Headers, right? I guess the only difference here is that we are unifying the header key as requtil.RequestIdHeaderKey since at line 187 the header could have had a different casing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since at line 187 the header could have had a different casing

Let me look into this more. You may be onto something here. Perhaps there is a simpler solution / root cause for the behaviors I was seeing.

// Parse the current chunk for a 'usage' block.
resp := parseRespForUsage(ctx, responseText)

// Accumulate token counts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments:

  1. From metrics perspective, we are not truly accumulating anything, this is just reporting the values in the last block.

  2. Which logic depends reqCtx.Usage while we are accumulating the stream that would require us to keep updating the usage rather than just doing it at the end?

Copy link
Contributor Author

@LukeAVanDrie LukeAVanDrie Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here was that the 'usage' block was not coming in the final "[DONE]" message in the hermetic testing environment. It was arriving in the penultimate message instead.

When processing the final message of the stream, it was empty, resulting in underreporting and empty labels. I need to do some more investigating to understand if this is an issue with the hermetic testing environment or something that can also occur with real traffic.

From metrics perspective, we are not truly accumulating anything, this is just reporting the values in the last block.

True, we just need to find the block that populates it. My comment is misleading as this is still a "detect and write once" process that is only reported after the stream ends.

Copy link
Contributor Author

@LukeAVanDrie LukeAVanDrie Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to understand and debug behavior from integration test results, and I am not too familiar with our request stream handling code and the properties we can rely upon in production. It's very possible my root cause analysis is wrong in places, but the symptoms I reported in the linked issues are reproducible.

Let me write some targeted unit tests over the relevant components to better understand this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here was that the 'usage' block was not coming in the final "[DONE]" message in the hermetic testing environment. It was arriving in the penultimate message instead.

This might be an issue in the hermetic test. In real traffic, the penultimate and last message come in a same chunk:

data: {"id":"...","object":"text_completion","created":1739400043,"model":"food-review-0","choices":[], "usage":{"prompt_tokens":7,"total_tokens":17,"completion_tokens":10}}

data: [DONE]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for confirming! I will fix the hermetic test setup to conform to this.

I think #1626 can be closed as WAI. I also reported an issue where metric labels were not being set properly, so I need to first confirm that this is solely triggered by the data: [DONE] final chunk case. If so, I'll revert my change to this response handling code.

Comment on lines +40 to +44
if header.RawValue != nil {
reqCtx.Request.Headers[key] = string(header.RawValue)
} else {
reqCtx.Request.Headers[key] = header.Value
}
Copy link
Contributor

@nirrozenbaum nirrozenbaum Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continuing @ahg-g line of thinking - I agree we should fix the ExtractHeaderValue helper func and then we can use it here, e.g.,:

reqCtx.Request.Headers[key] = ExtractHeaderValue(key)

I think the logic that handles RawValue or Value should be scoped to a single helper function.

Comment on lines +246 to +251
var value string
if len(header.RawValue) > 0 {
value = string(header.RawValue)
} else {
value = header.Value
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the single helper function?
RawValue/Value should be encapsulated to ExtractHeaderValue helper func.

Comment on lines +35 to +39
// Per the Envoy API, a header's value can be in either the `RawValue` (bytes) or `Value` (string) field.
if len(headerKv.RawValue) > 0 {
return string(headerKv.RawValue)
}
return headerKv.Value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

// GetFreePort finds and returns an available TCP port on the host.
// It works by asking the OS to allocate a port by listening on port 0, capturing the assigned address, and then
// immediately closing the listener.
func GetFreePort() (*net.TCPAddr, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great

@LukeAVanDrie
Copy link
Contributor Author

--- FAIL: TestFullDuplexStreamed_KubeInferenceObjectiveRequest (8.13s)
    --- FAIL: TestFullDuplexStreamed_KubeInferenceObjectiveRequest/ResponseHandling (0.67s)
        --- FAIL: TestFullDuplexStreamed_KubeInferenceObjectiveRequest/ResponseHandling/buffers_and_rewrites_multi-chunk_json_response (0.19s)
            util.go:86: Sending request: request_headers:{headers:{headers:{key:"x-gateway-inference-objective"  value:"sql-lora-sheddable"}  headers:{key:"x-gateway-model-name-rewrite"  value:"sql-lora-1fdg3"}  headers:{key:"x-request-id"  value:"test-static-id-1"}}  end_of_stream:true}
            util.go:86: Sending request: response_headers:{headers:{headers:{key:"content-type"  value:"application/json"}}}
            util.go:86: Sending request: response_body:{body:"{\"model\":\"sql-lora-sheddable\", \"prompt\": \"test\"}"}
            util.go:86: Sending request: response_body:{body:"}"  end_of_stream:true}
            util.go:118: Received response request_headers:{response:{header_mutation:{set_headers:{header:{key:"x-gateway-destination-endpoint"  raw_value:"192.168.1.1:8000"}}  set_headers:{header:{key:"x-request-id"  raw_value:"test-static-id-1"}}}  clear_route_cache:true}}  dynamic_metadata:{fields:{key:"envoy.lb"  value:{struct_value:{fields:{key:"x-gateway-destination-endpoint"  value:{string_value:"192.168.1.1:8000"}}}}}}
            util.go:118: Received response response_headers:{response:{header_mutation:{set_headers:{header:{key:"x-went-into-resp-headers"  raw_value:"true"}}  set_headers:{header:{key:"content-type"  raw_value:"application/json"}}}}}
            util.go:118: Received response response_body:{response:{body_mutation:{streamed_response:{body:"{\"model\":\"sql-lora-sheddable\", \"prompt\": \"test\"}}"  end_of_stream:true}}}}
            hermetic_test.go:947: Unexpected response, (-want +got):   []*ext_procv3.ProcessingResponse{
                  	Inverse(protocmp.Transform, protocmp.Message{"@type": s"envoy.service.ext_proc.v3.ProcessingResponse", "dynamic_metadata": protocmp.Message{"@type": s"google.protobuf.Struct", "fields": map[string]protocmp.Message{"envoy.lb": {"@type": s"google.protobuf.Value", "struct_value": protocmp.Message{"@type": s"google.protobuf.Struct", "fields": map[string]protocmp.Message{"x-gateway-destination-endpoint": {"@type": s"google.protobuf.Value", "string_value": string("192.168.1.1:8000")}}}}}}, "request_headers": protocmp.Message{"@type": s"envoy.service.ext_proc.v3.HeadersResponse", "response": protocmp.Message{"@type": s"envoy.service.ext_proc.v3.CommonResponse", "clear_route_cache": bool(true), "header_mutation": protocmp.Message{"@type": s"envoy.service.ext_proc.v3.HeaderMutation", "set_headers": []protocmp.Message{{"@type": s"envoy.config.core.v3.HeaderValueOption", "header": protocmp.Message{"@type": s"envoy.config.core.v3.HeaderValue", "key": string("x-gateway-destination-endpoint"), "raw_value": []uint8("192.168.1.1:8000")}}, {"@type": s"envoy.config.core.v3.HeaderValueOption", "header": protocmp.Message{"@type": s"envoy.config.core.v3.HeaderValue", "key": string("x-request-id"), "raw_value": []uint8("test-static-id-1")}}}}}}}),
                  	Inverse(protocmp.Transform, protocmp.Message{"@type": s"envoy.service.ext_proc.v3.ProcessingResponse", "response_headers": protocmp.Message{"@type": s"envoy.service.ext_proc.v3.HeadersResponse", "response": protocmp.Message{"@type": s"envoy.service.ext_proc.v3.CommonResponse", "header_mutation": protocmp.Message{"@type": s"envoy.service.ext_proc.v3.HeaderMutation", "set_headers": []protocmp.Message(Inverse(cmpopts.SortSlices, []protocmp.Message{{"@type": s"envoy.config.core.v3.HeaderValueOption", "header": protocmp.Message{"@type": s"envoy.config.core.v3.HeaderValue", "key": string("content-type"), "raw_value": []uint8("application/json")}}, {"@type": s"envoy.config.core.v3.HeaderValueOption", "header": protocmp.Message{"@type": s"envoy.config.core.v3.HeaderValue", "key": string("x-went-into-resp-headers"), "raw_value": []uint8("true")}}}))}}}}),
                  	Inverse(protocmp.Transform, protocmp.Message{
                  		"@type": s"envoy.service.ext_proc.v3.ProcessingResponse",
                  		"response_body": protocmp.Message{
                  			"@type": s"envoy.service.ext_proc.v3.BodyResponse",
                  			"response": protocmp.Message{
                  				"@type": s"envoy.service.ext_proc.v3.CommonResponse",
                  				"body_mutation": protocmp.Message{
                  					"@type": s"envoy.service.ext_proc.v3.BodyMutation",
                  					"streamed_response": protocmp.Message{
                  						"@type": s"envoy.service.ext_proc.v3.StreamedBodyResponse",
                  						"body": bytes.Join({
                  							`{"model":"sql-lora-sheddable",`,
                + 							" ",
                  							`"prompt":`,
                + 							" ",
                  							`"test"`,
                + 							"}",
                  							"}",
                  						}, ""),
                  						"end_of_stream": bool(true),
                  					},
                  				},
                  			},
                  		},
                  	}),
                  }
FAIL
coverage: [no statements]
FAIL	sigs.k8s.io/gateway-api-inference-extension/test/integration/epp	15.606s
	sigs.k8s.io/gateway-api-inference-extension/test/utils		coverage: 0.0% of statements
?   	sigs.k8s.io/gateway-api-inference-extension/version	[no test files]
FAIL
make: *** [Makefile:137: test] Error 1

This passes when I run make test or make test-integration locally. Interestingly, it fails in CI/CD environment for what appears to be minor formatting issues in the test expectations. Let me see how I can make this less brittle.

Copy link
Contributor Author

@LukeAVanDrie LukeAVanDrie Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary focus of this PR is a significant refactoring of the EPP hermetic test suite to achieve true test isolation. While it also includes important bug fixes that this new test rigor uncovered, the change here was necessary to unblock upcoming integration tests for the Flow Control layer, which require a clean and mutable environment for each test scenario.

Guidance for Reviewers

I would greatly appreciate your focus on the test refactoring itself:

  • Isolation Strategy: Does the combination of unique labels and a scoped cache seem robust and effective for serial execution?
  • Clarity & Maintainability: Is the new testHarness approach clearer and easier to extend?
  • Potential Risks: Do you see any potential downsides to this per-test setup approach?

I just want to make sure this doesn't get overlooked amidst the bug fix discussion as this is the heart of this PR.

Drawbacks of the Previous Approach

Previously, our tests relied on a shared, suite-level setup. The controller-runtime manager's cache was unscoped, meaning it watched all pods in the namespace. This design was fragile and prone to state leakage between test runs, making it difficult to write reliable tests for features that need to mutate backend state (e.g., marking pods as saturated/unsaturated).

The New testHarness

This PR overhauls this approach by introducing a testHarness that provides a fully encapsulated, hermetic environment for each test case.

Here’s a summary of the key isolation mechanisms:

  1. Per-Test Harness: Each test now runs within its own testHarness instance, which encapsulates all necessary components (server, client, manager) and state.
  2. Unique Resource Labeling: A unique test-run-id label (derived from a UUID) is generated for each harness and applied to all Kubernetes pods it creates.
  3. Scoped Kubernetes Cache: This is the most crucial part. The controller-runtime manager within each harness is configured with a cache that only watches pods matching that specific, unique test-run-id label. This prevents the test from seeing or interacting with pods from any other test run.
  4. Dedicated Lifecycle Management: Each harness starts its own server instance and creates its own gRPC client. A robust cleanup function, automatically called by t.Cleanup, uses a context.CancelFunc to gracefully tear down the server, manager, and client, and then deletes the uniquely labeled Kubernetes pods.

A Note on Parallelism

Despite this isolation, tests cannot use t.Parallel() due to controller-runtime's global metrics registry. Our solution for clean serial execution is resetting the registry post-test (t.Cleanup(metrics.Reset)). True parallelism would require per-test envtest instances, which is prohibitively slow. This refactoring is a pragmatic step, solving resource contention while sharing the suite-level envtest setup.

Thanks!

@kfswain
Copy link
Collaborator

kfswain commented Sep 22, 2025

The logic only parsed the final [DONE] message for a usage block, failing to accumulate token counts from earlier messages.

IIRC the usage field is always on the tail end of a streamed message, I remember going through these cases with Jeff. Why would the previous messages have a usage block?

@LukeAVanDrie
Copy link
Contributor Author

LukeAVanDrie commented Sep 22, 2025

The logic only parsed the final [DONE] message for a usage block, failing to accumulate token counts from earlier messages.

IIRC the usage field is always on the tail end of a streamed message, I remember going through these cases with Jeff. Why would the previous messages have a usage block?

This may actually be an issue in the test setup then where we are not accurately simulating the streaming use case. If this is never the case in production, I will update the test utils to reflect this and revert this particular change.

I changed the structure of how the tests are run (so they are executed in isolated environments) but not the assertions, so I was chasing down failures and may have root caused this incorrectly.

@ahg-g
Copy link
Contributor

ahg-g commented Sep 22, 2025

The logic only parsed the final [DONE] message for a usage block, failing to accumulate token counts from earlier messages.

IIRC the usage field is always on the tail end of a streamed message, I remember going through these cases with Jeff. Why would the previous messages have a usage block?

This may actually be an issue in the test setup then where we are not accurately simulating the streaming use case. If this is never the case in production, I will update the test utils to reflect this and revert this particular change.

I changed the structure of how the tests are run (so they are executed in isolated environments) but not the assertions, so I was chasing down failures and may have root caused this incorrectly.

I don't mind keeping the logic you have if we think it is more "resilient"

@kfswain
Copy link
Collaborator

kfswain commented Sep 22, 2025

The logic only parsed the final [DONE] message for a usage block, failing to accumulate token counts from earlier messages.

IIRC the usage field is always on the tail end of a streamed message, I remember going through these cases with Jeff. Why would the previous messages have a usage block?

This may actually be an issue in the test setup then where we are not accurately simulating the streaming use case. If this is never the case in production, I will update the test utils to reflect this and revert this particular change.
I changed the structure of how the tests are run (so they are executed in isolated environments) but not the assertions, so I was chasing down failures and may have root caused this incorrectly.

I don't mind keeping the logic you have if we think it is more "resilient"

Yes, agreed, that's fine.

I'm just walking through the risk assessment of these bugs, and weighing the risk of patching new logic, vs the risk of these bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
6 participants